Data creation

Collection, spreadsheet design and digitizing

Aud Halbritter and Joe Chipperfield

Introduction

Data creation is the systematic collection of data for a specific purpose and digitizing the data for downstream processing in statistical analysis or sharing with or reuse by others.

What is the plan?

Time Task
30 min Data collection - desing method
30 min Go outside and collect data
30 min Discussion on data collection and digitizing

Data collection

Data collection methods

  • automatic data collection (e.g. weather station, LiCor)

  • manual data collection (samples, measure, count)

Exercise: design method (15 min)

  • Make groups of 4-5 students.

  • Decide what data you want to collect. (e.g. snow depth, length of icicles, plant height)

  • Decide on a method to collect the data (e.g. paper, phone).

  • Design a spreadsheet/protocol. Think about what is the relevant information that you need.

Reflect on the decisions you made and if you would change anything if you had the means.

Discussion: collect data (10 min)

  • What decisions did you make?
  • What are the cons and pros of your method?

Exercise: collect your data (30 min)

  • Go outside and collect your data. It is not important to collect as many data points as possible.

  • Reflect if the method you used was suitable for the data you collected.

Discussion: data collection (10 min)

  • How did it go?

  • Was the method appropriate?

  • Did you miss any information?

Key things during data collection

  • Logistical issues

  • Calibration of instruments

  • Multiple measurements/observations/samples

  • Template/protocol for sampling (multiple data collectors, over time)

  • Take notes during data collection

  • Collect meta data that could be useful for wider usage

Source: britishecologicalsociety.org/publications/guides-to

Digitizing and process data

Disscusion: digitizing (15 min)

  • What would be your strategy for digitize the data?

  • What could be problems?

Data validation tools

Use data validation tools for data entry.

  • format cells (dates)

  • set ranges

  • drop down menu

Be consistent

  • file names

  • variable names

  • factor levels

  • missing data

  • notes

Careful with dates

Data workflow

  • Digitize data

  • Keep raw data raw

  • No calculations in raw data

  • Code-based data cleaning

  • Clean data

  • Document your data (data about your data)

Literature

  • BES guide for Data management

  • BioStats book Data collection

  • Broman, Karl W, and Kara H Woo. 2018. “Data Organization in Spreadsheets.” The American Statistician 72 (1): 2–10.